Life Expectancy Factors and Prediction
Our study will focus on various factors affecting life expectancy considering demographic variables, income composition and mortality rates. Life expectancy dataset made available to public for the purpose of health data analysis. This dataset is related to life expectancy, health factors from year 2000-2015 for 193 countries has been collected from comes Global Health Observatory (GHO) data repository under World Health Organization. The dataset contains 2938 records and 22 columns. For our project dataset is sourced from Kaggle: Kaggle Dataset Source Link
Research: Our team like to apply the knowledge gained in STAT420 course to understand how data analysis techniques and linear regression model can guide us to build best possible model and predict within a reasonable amount of confidence the life expectancy of different countries around the world. This will eventually help government officials adopt the necessary steps to help increase the life expectancy.
Inspiration and Personal interest: We would like to explore how statistics combined with technology can make difference in real-world use cases impacting every individual, provides guidelines to government policies and encourage them take informed decisions in time. We are planning to use different predictors like BMI, alcohol intake, HIV/aids and many more. Our aim is through statistics to help determining the predicting factors that are contributing to lower value of life expectancy. Explainable model clearly identify predictors contributing to life expectancy and by improving certain predictors (for example improving the polio vaccine intake or reducing alcohol) life expectancy can be improved.
In the following dataset, Life expectancy is the response continuous variable, Country and Status are the categorical explanatory variables and rest are either continuous numeric or numeric explanatory variables.
| Variables | Type | Description |
|---|---|---|
| Country | character | Country |
| Year | numerical | Year |
| Status | character | Developed or Developing status |
| Life expectancy | numerical | Life Expectancy in age |
| Adult Mortality | numerical | Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population) |
| Infant deaths | numerical | Number of Infant Deaths per 1000 population |
| Alcohol | numerical | Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol) |
| Percentage expenditure | numerical | Expenditure on health as a percentage of Gross Domestic Product per capita(%) |
| Hepatitis B | numerical | Hepatitis B (HepB) immunization coverage among 1-year-olds (%) |
| Measles | numerical | Measles - number of reported cases per 1000 population |
| BMI | numerical | Average Body Mass Index of entire population |
| Under-five deaths | numerical | Number of under-five deaths per 1000 population |
| Polio | numerical | Polio (Pol3) immunization coverage among 1-year-olds (%) |
| Total expenditure | numerical | General government expenditure on health as a percentage of total government expenditure (%) |
| Diphtheria | numerical | Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%) |
| HIV/AIDS | numerical | Deaths per 1 000 live births HIV/AIDS (0-4 years) |
| GDP | numerical | Gross Domestic Product per capita (in USD) |
| Population | numerical | Population of the country |
| Thinness 1-19 yrs | numerical | Prevalence of thinness among children and adolescents for Age 10 to 19 (% ) |
| Thinness 5-9 yrs | numerical | Prevalence of thinness among children for Age 5 to 9(%) |
| Income comp. of resources | numerical | Human Development Index in terms of income composition of resources (index ranging from 0 to 1) |
| Schooling | numerical | Number of years of Schooling(years) |
The goal of the model is to identify what are the key contributing factors towards life expectancy and to accurately predict the life expectancy.
Now we investigate the dataset to see what actions we need to take with the dataset before creating a model
The data source format is csv and can be read into R, below are the sample records and variable data types. Since the output is very long, we use echo=FALSE, but as a sample, we show the output below for the first few rows
| Country | Year | Status | Life.expectancy | Adult.Mortality | infant.deaths | Alcohol | percentage.expenditure | Hepatitis.B | Measles | BMI | under.five.deaths | Polio | Total.expenditure | Diphtheria | HIV.AIDS | GDP | Population | thinness..1.19.years | thinness.5.9.years | Income.composition.of.resources | Schooling |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 2015 | Developing | 65.0 | 263 | 62 | 0.01 | 71.27962 | 65 | 1154 | 19.1 | 83 | 6 | 8.16 | 65 | 0.1 | 584.2592 | 33736494 | 17.2 | 17.3 | 0.479 | 10.1 |
| Afghanistan | 2014 | Developing | 59.9 | 271 | 64 | 0.01 | 73.52358 | 62 | 492 | 18.6 | 86 | 58 | 8.18 | 62 | 0.1 | 612.6965 | 327582 | 17.5 | 17.5 | 0.476 | 10.0 |
| Afghanistan | 2013 | Developing | 59.9 | 268 | 66 | 0.01 | 73.21924 | 64 | 430 | 18.1 | 89 | 62 | 8.13 | 64 | 0.1 | 631.7450 | 31731688 | 17.7 | 17.7 | 0.470 | 9.9 |
| Afghanistan | 2012 | Developing | 59.5 | 272 | 69 | 0.01 | 78.18421 | 67 | 2787 | 17.6 | 93 | 67 | 8.52 | 67 | 0.1 | 669.9590 | 3696958 | 17.9 | 18.0 | 0.463 | 9.8 |
| Afghanistan | 2011 | Developing | 59.2 | 275 | 71 | 0.01 | 7.09711 | 68 | 3013 | 17.2 | 97 | 68 | 7.87 | 68 | 0.1 | 63.5372 | 2978599 | 18.2 | 18.2 | 0.454 | 9.5 |
| Afghanistan | 2010 | Developing | 58.8 | 279 | 74 | 0.01 | 79.67937 | 66 | 1989 | 16.7 | 102 | 66 | 9.20 | 66 | 0.1 | 553.3289 | 2883167 | 18.4 | 18.4 | 0.448 | 9.2 |
Rename these columns
Most of the columns are not one word so we will rename them to one word for ease of programming
names(life_data)[c(4,5,6,8,9,12,14,16,19,20,21)] = c("Life_expectancy","Adult_Mortality","Infant_deaths",
"Percentage_expenditure","Hepatitis_B","Under_5_deaths",
"Total_expenditure","HIV_AIDS","Thinness_1_to_19_yrs","Thinness_5_to_9_yrs",
"Income_composition_of_resources")
Remove these columns
life_data = subset(life_data, select = -c(`Country`))
Other Observations
Status column is a factor variable with two levels - Developed and Developing
is.factor(life_data$Status)
## [1] FALSE
life_data$Status = as.factor(life_data$Status)
levels(life_data$Status)
## [1] "Developed" "Developing"
# Box plot
boxplot(Life_expectancy ~ Status, data = life_data,
xlab = "Status",
ylab = "Life expectancy",
main = "Life expectancy vs Status",
pch = 20,
cex = 2,
col = "darkorange",
border = "dodgerblue"
)
We can observe the following two points from the box plot
* Mean of Life Expectancy of a developed nation is higher than the mean of a developing nation
* Variance in Life Expectancy of a developing nation is higher than that of a developed nation
There is some missing data. We will do a quick summary on how many values is each variable missing
v = sort(colSums(is.na(life_data)),decreasing = TRUE)
knitr::kable(v, col.names = c("NA Count"),align = "ll", caption = "Variables with missing value count")
| NA Count | |
|---|---|
| Population | 652 |
| Hepatitis_B | 553 |
| GDP | 448 |
| Total_expenditure | 226 |
| Alcohol | 194 |
| Income_composition_of_resources | 167 |
| Schooling | 163 |
| BMI | 34 |
| Thinness_1_to_19_yrs | 34 |
| Thinness_5_to_9_yrs | 34 |
| Polio | 19 |
| Diphtheria | 19 |
| Life_expectancy | 10 |
| Adult_Mortality | 10 |
| Year | 0 |
| Status | 0 |
| Infant_deaths | 0 |
| Percentage_expenditure | 0 |
| Measles | 0 |
| Under_5_deaths | 0 |
| HIV_AIDS | 0 |
#remove missing observations
life_data = na.omit(life_data)
Now, to identify columns that have high correlations. First we plot the correlation matrix
corrplot(cor(life_data[,3:21]))
We define high correlation as great than 70% correlation.
In order to identify correlation we first create a dataframe that has only numeric predictors
life_data_with_char_cols_removed = subset(life_data, select = -c(`Status`))
Now we identify the correlation in numeric predictors. We mark all correlations less than .70 as NA, so that we can easily identify highly correlated features
For brevity of the report, we only show the head of the correlation matrix, while we had looked at the entire matrix to come to our conclusions below
cor_relation = cor(life_data_with_char_cols_removed, use = "complete.obs")
cor_relation[abs(cor_relation) < 0.80] <- NA
head(cor_relation,20)
## Year Life_expectancy Adult_Mortality
## Year 1 NA NA
## Life_expectancy NA 1 NA
## Adult_Mortality NA NA 1
## Infant_deaths NA NA NA
## Alcohol NA NA NA
## Percentage_expenditure NA NA NA
## Hepatitis_B NA NA NA
## Measles NA NA NA
## BMI NA NA NA
## Under_5_deaths NA NA NA
## Polio NA NA NA
## Total_expenditure NA NA NA
## Diphtheria NA NA NA
## HIV_AIDS NA NA NA
## GDP NA NA NA
## Population NA NA NA
## Thinness_1_to_19_yrs NA NA NA
## Thinness_5_to_9_yrs NA NA NA
## Income_composition_of_resources NA NA NA
## Schooling NA NA NA
## Infant_deaths Alcohol Percentage_expenditure
## Year NA NA NA
## Life_expectancy NA NA NA
## Adult_Mortality NA NA NA
## Infant_deaths 1.000000 NA NA
## Alcohol NA 1 NA
## Percentage_expenditure NA NA 1.000000
## Hepatitis_B NA NA NA
## Measles NA NA NA
## BMI NA NA NA
## Under_5_deaths 0.996906 NA NA
## Polio NA NA NA
## Total_expenditure NA NA NA
## Diphtheria NA NA NA
## HIV_AIDS NA NA NA
## GDP NA NA 0.959299
## Population NA NA NA
## Thinness_1_to_19_yrs NA NA NA
## Thinness_5_to_9_yrs NA NA NA
## Income_composition_of_resources NA NA NA
## Schooling NA NA NA
## Hepatitis_B Measles BMI Under_5_deaths Polio
## Year NA NA NA NA NA
## Life_expectancy NA NA NA NA NA
## Adult_Mortality NA NA NA NA NA
## Infant_deaths NA NA NA 0.996906 NA
## Alcohol NA NA NA NA NA
## Percentage_expenditure NA NA NA NA NA
## Hepatitis_B 1 NA NA NA NA
## Measles NA 1 NA NA NA
## BMI NA NA 1 NA NA
## Under_5_deaths NA NA NA 1.000000 NA
## Polio NA NA NA NA 1
## Total_expenditure NA NA NA NA NA
## Diphtheria NA NA NA NA NA
## HIV_AIDS NA NA NA NA NA
## GDP NA NA NA NA NA
## Population NA NA NA NA NA
## Thinness_1_to_19_yrs NA NA NA NA NA
## Thinness_5_to_9_yrs NA NA NA NA NA
## Income_composition_of_resources NA NA NA NA NA
## Schooling NA NA NA NA NA
## Total_expenditure Diphtheria HIV_AIDS GDP
## Year NA NA NA NA
## Life_expectancy NA NA NA NA
## Adult_Mortality NA NA NA NA
## Infant_deaths NA NA NA NA
## Alcohol NA NA NA NA
## Percentage_expenditure NA NA NA 0.959299
## Hepatitis_B NA NA NA NA
## Measles NA NA NA NA
## BMI NA NA NA NA
## Under_5_deaths NA NA NA NA
## Polio NA NA NA NA
## Total_expenditure 1 NA NA NA
## Diphtheria NA 1 NA NA
## HIV_AIDS NA NA 1 NA
## GDP NA NA NA 1.000000
## Population NA NA NA NA
## Thinness_1_to_19_yrs NA NA NA NA
## Thinness_5_to_9_yrs NA NA NA NA
## Income_composition_of_resources NA NA NA NA
## Schooling NA NA NA NA
## Population Thinness_1_to_19_yrs
## Year NA NA
## Life_expectancy NA NA
## Adult_Mortality NA NA
## Infant_deaths NA NA
## Alcohol NA NA
## Percentage_expenditure NA NA
## Hepatitis_B NA NA
## Measles NA NA
## BMI NA NA
## Under_5_deaths NA NA
## Polio NA NA
## Total_expenditure NA NA
## Diphtheria NA NA
## HIV_AIDS NA NA
## GDP NA NA
## Population 1 NA
## Thinness_1_to_19_yrs NA 1.000000
## Thinness_5_to_9_yrs NA 0.927913
## Income_composition_of_resources NA NA
## Schooling NA NA
## Thinness_5_to_9_yrs
## Year NA
## Life_expectancy NA
## Adult_Mortality NA
## Infant_deaths NA
## Alcohol NA
## Percentage_expenditure NA
## Hepatitis_B NA
## Measles NA
## BMI NA
## Under_5_deaths NA
## Polio NA
## Total_expenditure NA
## Diphtheria NA
## HIV_AIDS NA
## GDP NA
## Population NA
## Thinness_1_to_19_yrs 0.927913
## Thinness_5_to_9_yrs 1.000000
## Income_composition_of_resources NA
## Schooling NA
## Income_composition_of_resources Schooling
## Year NA NA
## Life_expectancy NA NA
## Adult_Mortality NA NA
## Infant_deaths NA NA
## Alcohol NA NA
## Percentage_expenditure NA NA
## Hepatitis_B NA NA
## Measles NA NA
## BMI NA NA
## Under_5_deaths NA NA
## Polio NA NA
## Total_expenditure NA NA
## Diphtheria NA NA
## HIV_AIDS NA NA
## GDP NA NA
## Population NA NA
## Thinness_1_to_19_yrs NA NA
## Thinness_5_to_9_yrs NA NA
## Income_composition_of_resources 1 NA
## Schooling NA 1
Based on the above correlation matrix and corroborated by the correlation plot , we make the below observations
Remove the following predictors
life_data_clean = subset(life_data, select = -c(`Infant_deaths`, `GDP`, `Thinness_5_to_9_yrs`))
dim(life_data_clean)
## [1] 1649 18
We know that step wont work with missing values. Hence we will need to remove those.
We started with 1649 observations and 21 columns and after our initial cleanup we end up with 1649 observations and 18 columns
Now, before we begin modeling, we look at the pairs plots to see if any of the parameters are an obvious choice for transformations
life_data_with_char_cols_removed = subset(life_data_clean, select = -c(`Status`))
pairs(life_data_with_char_cols_removed , col = "dodgerblue")
We make the following observations from the plot
Adult Mortality and HIV AIDS.Income comp. of resources and Schooling.Polio and Diphtheria.Potential for transformations
From pair plot we definitely see need for some transformation specially with Adult Mortality and Income comp. of resources. We will get exact tranformation format with boxcox procedure.
Diagnostics Function
Diagnostics function for Fitted versus Residuals & Normal Q-Q plots and testing model against Shapiro-Wilk & Breusch-Pagan test
diagnostics = function(model, pcol = "grey", lcol = "dodgerblue", alpha = 0.01,
plotit = TRUE, testit = TRUE) {
if(plotit == TRUE) {
par(mfrow=c(1,2))
plot(fitted(model), resid(model), col = pcol, pch = 20,
xlab = "Fitted", ylab = "Residuals.",
main = "Fitted versus Residuals")
abline(h = 0, col = lcol, lwd = 2)
qqnorm(resid(model), main = "Normal Q-Q Plot", col = pcol)
qqline(resid(model), col = lcol, lwd = 2)
}
if (testit == TRUE) {
normality_p_val = shapiro.test(resid(model))$p.value
normality_decision = ifelse(normality_p_val < alpha, "Reject", "Fail to Reject.")
(bptest_pvalue = bptest(model)$"p.value"[[1]])
const_variance_decision = ifelse(bptest_pvalue < alpha, "Reject", "Fail to Reject.")
rmse = round(sqrt(mean(resid(model) ^ 2)), 4)
aic = round(extractAIC(model)[2], 2)
num_predictors = length(coef(model)) - 1
res = list(num_predictors = num_predictors, sw_pvalue = normality_p_val, sw_decision = normality_decision,
bp_pvalue = bptest_pvalue, bp_decision = const_variance_decision, RMSE = rmse,
AdjustedR2 = round(summary(model)$"adj.r.squared", 5), AIC=aic)
return(res)
}
}
In order to validate that transformations are necessary we will start with a simple additive model and look at its diagnostics plots
model_1 = lm( Life_expectancy ~ . , data=life_data_clean)
knitr::kable(data.frame(diagnostics(model_1)))
| num_predictors | sw_pvalue | sw_decision | bp_pvalue | bp_decision | RMSE | AdjustedR2 | AIC |
|---|---|---|---|---|---|---|---|
| 17 | 0 | Reject | 0 | Reject | 3.6097 | 0.82976 | 4269.41 |
Based on the diagnostics and the plots we see that some kind of transformation for the response is necessary.
In order to figure out the transformation for the response, we find the lambda for it
bc = boxcox(model_1)
We know that the most common Box-Cox Transformations are
| \(\lambda\) | Transformed Data |
|---|---|
| -2 | \(y^{-2}\) |
| -1 | \(y^{-1}\) |
| -.5 | \(1 \over \sqrt y\) |
| 0 | ln(y) |
| .5 | \(\sqrt y\) |
| 1 | y |
| 2 | \(y^2\) |
since our \(\lambda\) is close to 2 we will do \(y^2\) transformations
We redo the model and look at the diagnostics plots again
Apply transformation and test model
#fit new linear regression model using the Box-Cox transformation
new_model = lm(Life_expectancy ^ 2 ~ ., data=life_data_clean)
knitr::kable(data.frame(diagnostics(new_model)))
| num_predictors | sw_pvalue | sw_decision | bp_pvalue | bp_decision | RMSE | AdjustedR2 | AIC |
|---|---|---|---|---|---|---|---|
| 17 | 3e-06 | Reject | 0 | Reject | 495.775 | 0.81997 | 20503.8 |
We see that the plots and the diagnostics are a lot better, but there seems to be some scope for improvement.
Let us now identify the lambda transformations for the other columns we identified and using those variables as response, fit the model, but keep Life_expectancy ^ 2 in the predictor with others
m3 = lm(Adult_Mortality ~ . - Life_expectancy + Life_expectancy ^ 2, data = life_data_clean)
boxcox(m3,xlab = "lambda for Adult_Mortality")
We should apply \(\sqrt y\) transformation to Adult_Mortality since \(\lambda\) is close to 0.5
m3 = lm(Alcohol ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Alcohol")
We should apply \(\sqrt y\) transformation to Alcohol since \(\lambda\) is close to 0.5
m3 = lm(Hepatitis_B ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Hepatitis_B")
We should apply \(y^2\) transformation to Hepatitis_B since \(\lambda\) is close to 2
m3 = lm(HIV_AIDS ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for HIV_AIDS")
We should apply 1/\(\sqrt HIV AIDS\) transformation to HIV_AIDS since \(\lambda\) is close to -0.5
m3 = lm(Diphtheria ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Diphtheria")
We should apply \(y^2\) transformation to Diphtheria since \(\lambda\) is close to 2
m3 = lm(Total_expenditure ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Total_expenditure")
We should apply no transformation to Total_expenditure since \(\lambda\) is close to 1.0
m3 = lm(Population ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Population")
We should apply log transformation to Population since \(\lambda\) is close to 0
m3 = lm(Polio ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Polio")
We should apply \(y^2\) transformation to Polio since \(\lambda\) is close to 2
m3 = lm(Thinness_1_to_19_yrs ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda Thinness_1_to_19_yrs ")
We should apply log transformation to Thinness_1_to_19_yrs since \(\lambda\) is close to 0
m3 = lm(BMI ~ . - Life_expectancy + Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for BMI")
No transformation needed for BMI since \(\lambda\) is close to 1
Based on the above analysis we create the below models to start with
model_additive = lm (Life_expectancy ~ ., data = life_data_clean)
model_all = lm(Life_expectancy ~ ., data = life_data_clean)
all_models = summary(regsubsets(Life_expectancy ~ ., data = life_data_clean))
#all_models$which
all_models$adjr2
## [1] 0.529160 0.730063 0.786707 0.808711 0.814148 0.819397 0.822532 0.825200
The above show the adjusted R2 of all the models selected by the exhaustive search
(best_r2_ind = which.max(all_models$adjr2))
## [1] 8
Selected the best model
all_models$which[best_r2_ind,]
## (Intercept) Year
## TRUE TRUE
## StatusDeveloping Adult_Mortality
## FALSE TRUE
## Alcohol Percentage_expenditure
## FALSE TRUE
## Hepatitis_B Measles
## FALSE FALSE
## BMI Under_5_deaths
## TRUE FALSE
## Polio Total_expenditure
## FALSE FALSE
## Diphtheria HIV_AIDS
## TRUE TRUE
## Population Thinness_1_to_19_yrs
## FALSE FALSE
## Income_composition_of_resources Schooling
## TRUE TRUE
#p = length(coef(model_all))
p = 9
n = length(resid(model_all))
le_model_aic = n * log(all_models$rss/n) + 2 * (2:p)
(best_aic_ind = which.min(le_model_aic))
## [1] 8
all_models$which[best_aic_ind,]
## (Intercept) Year
## TRUE TRUE
## StatusDeveloping Adult_Mortality
## FALSE TRUE
## Alcohol Percentage_expenditure
## FALSE TRUE
## Hepatitis_B Measles
## FALSE FALSE
## BMI Under_5_deaths
## TRUE FALSE
## Polio Total_expenditure
## FALSE FALSE
## Diphtheria HIV_AIDS
## TRUE TRUE
## Population Thinness_1_to_19_yrs
## FALSE FALSE
## Income_composition_of_resources Schooling
## TRUE TRUE
#store to use for later
le_model_best_aic = lm (Life_expectancy ~ Year +
+ Adult_Mortality + Percentage_expenditure + BMI + Diphtheria + HIV_AIDS+
+ Schooling + Income_composition_of_resources, data = life_data_clean)
# model complexity
plot( le_model_aic ~ I(2:p),
xlab = "p = Number of parameters",
ylab = "AIC",
main = "AIC vs Model complexity",
pch = 20,
cex = 2,
col = "darkorange",
type = "b"
)
#backward aic
le_model_back_aic = step(model_additive, trace = 0)
coef(le_model_back_aic)
## (Intercept) Year
## 3.44567e+02 -1.45521e-01
## StatusDeveloping Adult_Mortality
## -8.94369e-01 -1.72102e-02
## Alcohol Percentage_expenditure
## -1.73065e-01 4.49088e-04
## Measles BMI
## 1.55660e-05 3.57100e-02
## Under_5_deaths Total_expenditure
## -3.11513e-03 8.97790e-02
## Diphtheria HIV_AIDS
## 2.05685e-02 -4.52247e-01
## Population Income_composition_of_resources
## 2.89972e-09 1.11996e+01
## Schooling
## 9.32685e-01
#forward aic
le_model_fwd_aic_start = lm (Life_expectancy ~ 1, data = life_data_clean)
le_model_fwd_aic = step(le_model_fwd_aic_start,
scope =Life_expectancy ~
Year
+ Status
+ Income_composition_of_resources
+ Adult_Mortality
+ Total_expenditure
+ Alcohol
+ Percentage_expenditure
+ Hepatitis_B
+ Measles
+ BMI
+ Under_5_deaths
+ Polio
+ HIV_AIDS
+ Diphtheria
+ Thinness_1_to_19_yrs
+ Schooling
,
direction = "forward", trace = 0)
le_model_bic = n * log(all_models$rss/n) + log(n) * (2:p)
(best_bic_ind = which.min(le_model_bic))
## [1] 8
all_models$which[best_bic_ind,]
## (Intercept) Year
## TRUE TRUE
## StatusDeveloping Adult_Mortality
## FALSE TRUE
## Alcohol Percentage_expenditure
## FALSE TRUE
## Hepatitis_B Measles
## FALSE FALSE
## BMI Under_5_deaths
## TRUE FALSE
## Polio Total_expenditure
## FALSE FALSE
## Diphtheria HIV_AIDS
## TRUE TRUE
## Population Thinness_1_to_19_yrs
## FALSE FALSE
## Income_composition_of_resources Schooling
## TRUE TRUE
#store to use for later
le_model_best_bic = lm (Life_expectancy ~ Year +
+ Adult_Mortality + Percentage_expenditure + Schooling + BMI + HIV_AIDS+
+ Diphtheria + Income_composition_of_resources, data = life_data_clean)
#backward bic
le_model_back_bic = step(model_additive, trace = 0, k = log(n))
coef(le_model_back_bic)
## (Intercept) Year
## 345.452945985 -0.146367347
## Adult_Mortality Alcohol
## -0.017556785 -0.137593048
## Percentage_expenditure BMI
## 0.000495584 0.036261326
## Under_5_deaths Diphtheria
## -0.001883351 0.021854602
## HIV_AIDS Income_composition_of_resources
## -0.449795341 11.184397773
## Schooling
## 0.961446467
#forward bic
le_model_fwd_bic_start = lm (Life_expectancy ~ 1, data = life_data_clean)
le_model_fwd_bic = step(le_model_fwd_bic_start,
scope =Life_expectancy ~
Year
+ Status
+ Income_composition_of_resources
+ Adult_Mortality
+ Total_expenditure
+ Alcohol
+ Percentage_expenditure
+ Hepatitis_B
+ Measles
+ BMI
+ Under_5_deaths
+ Polio
+ HIV_AIDS
+ Diphtheria
+ Thinness_1_to_19_yrs
,
direction = "forward", k = log(n), trace = 0)
#model_additive_backward_aic = step (model_additive, trace = 0)
model_transform = lm (Life_expectancy ^ 2 ~ Status + sqrt(Alcohol)
+ sqrt(Adult_Mortality) + Income_composition_of_resources ^ 2 + 1/sqrt(Total_expenditure) + log(Population)
+ BMI + 1/sqrt(HIV_AIDS) + Diphtheria ^ 2 + Hepatitis_B ^ 2 + Measles + Percentage_expenditure
+ Polio ^2 + log(Thinness_1_to_19_yrs),
data = life_data_clean)
model_transform_backward_aic = step (model_transform, trace = 0)
Based on playing around with the data and trying various different interaction between variables we came up with following interactive polynomial transformation model.
model_poly = lm(Life_expectancy~sqrt(Adult_Mortality)*(Income_composition_of_resources^2)*Percentage_expenditure*Schooling*BMI+HIV_AIDS+Diphtheria+Polio+Year*Status,data=life_data_clean)
To get the rmse of the transformed model we need to get the fitted values back to the original scale
rmse_transformed = sqrt(mean((life_data_clean$Life_expectancy - sqrt(fitted(model_transform)))^2))
rmse_transform_backward_aic = sqrt(mean((life_data_clean$Life_expectancy - sqrt(fitted(model_transform_backward_aic)))^2))
df_result = rbind(m_additive = diagnostics(model_additive,plotit = FALSE),
#m_additive_step = diagnostics(model_additive_backward_aic,plotit = FALSE),
m_transform_1 = diagnostics(model_transform,plotit = FALSE),
m_transform_1_step = diagnostics(model_transform,plotit = FALSE),
model_min_aic = diagnostics(le_model_best_aic,plotit = FALSE),
model_min_bic = diagnostics(le_model_best_bic,plotit = FALSE),
model_back_aic = diagnostics(le_model_back_aic,plotit = FALSE),
model_back_bic = diagnostics(le_model_back_bic,plotit = FALSE),
model_fwd_aic = diagnostics(le_model_fwd_aic,plotit = FALSE),
model_fwd_bic = diagnostics(le_model_fwd_bic,plotit = FALSE),
model_poly = diagnostics(model_poly,plotit = FALSE)
)
#update the rmse of the transformed models
df_result['m_transform_1',]$RMSE = round(rmse_transformed,4)
df_result['m_transform_1_step',]$RMSE = round(rmse_transform_backward_aic,4)
knitr::kable(df_result, align = "cccccccc", caption = "Model Comparison Summary")
| num_predictors | sw_pvalue | sw_decision | bp_pvalue | bp_decision | RMSE | AdjustedR2 | AIC | |
|---|---|---|---|---|---|---|---|---|
| m_additive | 17 | 5.80987998863635e-08 | Reject | 6.41579265092972e-33 | Reject | 3.6097 | 0.82976 | 4269.41 |
| m_transform_1 | 12 | 5.58980035080847e-15 | Reject | 3.40055212895347e-86 | Reject | 4.8562 | 0.71269 | 21269.62 |
| m_transform_1_step | 12 | 5.58980035080847e-15 | Reject | 3.40055212895347e-86 | Reject | 4.8573 | 0.71269 | 21269.62 |
| model_min_aic | 8 | 6.15431590889256e-10 | Reject | 3.4663551604506e-29 | Reject | 3.6678 | 0.8252 | 4304.07 |
| model_min_bic | 8 | 6.15431590889256e-10 | Reject | 3.46635516045084e-29 | Reject | 3.6678 | 0.8252 | 4304.07 |
| model_back_aic | 14 | 4.22813740389515e-08 | Reject | 1.3205694135825e-34 | Reject | 3.6138 | 0.82969 | 4267.14 |
| model_back_bic | 10 | 2.23394873540412e-08 | Reject | 4.56792467310997e-35 | Reject | 3.6319 | 0.8284 | 4275.64 |
| model_fwd_aic | 13 | 2.64491669882093e-08 | Reject | 1.37799693326893e-34 | Reject | 3.617 | 0.82949 | 4268.02 |
| model_fwd_bic | 10 | 5.30301565996376e-13 | Reject | 1.60142326938387e-81 | Reject | 3.8867 | 0.80348 | 4499.21 |
| model_poly | 37 | 6.78173115056804e-08 | Reject | 6.79157135486358e-17 | Reject | 3.5305 | 0.83513 | 4236.26 |
Looking at the above table, we can clearly see that the polynomial model is yielding a good model. The RMSE is the lowest but it has does not have the lowest predictors. Also its adj R2 is the highest
We now perform Anova tests
First test between le_model_back_aic and model_poly
anova(le_model_back_aic, model_poly)
## Analysis of Variance Table
##
## Model 1: Life_expectancy ~ Year + Status + Adult_Mortality + Alcohol +
## Percentage_expenditure + Measles + BMI + Under_5_deaths +
## Total_expenditure + Diphtheria + HIV_AIDS + Population +
## Income_composition_of_resources + Schooling
## Model 2: Life_expectancy ~ sqrt(Adult_Mortality) * (Income_composition_of_resources^2) *
## Percentage_expenditure * Schooling * BMI + HIV_AIDS + Diphtheria +
## Polio + Year * Status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1634 21535
## 2 1611 20554 23 981 3.343 1.62e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Second between two best models model_additive and model_poly
anova(model_additive, model_poly)
## Analysis of Variance Table
##
## Model 1: Life_expectancy ~ Year + Status + Adult_Mortality + Alcohol +
## Percentage_expenditure + Hepatitis_B + Measles + BMI + Under_5_deaths +
## Polio + Total_expenditure + Diphtheria + HIV_AIDS + Population +
## Thinness_1_to_19_yrs + Income_composition_of_resources +
## Schooling
## Model 2: Life_expectancy ~ sqrt(Adult_Mortality) * (Income_composition_of_resources^2) *
## Percentage_expenditure * Schooling * BMI + HIV_AIDS + Diphtheria +
## Polio + Year * Status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1631 21487
## 2 1611 20554 20 932.3 3.654 9.02e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From both the anova tests its apparent that larger model `model_poly’ is statistically preferred.
In this section we’ll evaluate prediction abilities of the models. First let’s split the data set to training and “test”. 75% for training and 25% for test datasets
trn_idx = sample(nrow(life_data_clean), round(nrow(life_data_clean) * 0.75))
life_data_clean_trn = life_data_clean[trn_idx, ]
life_data_clean_tst = life_data_clean[-trn_idx, ]
Train several found models:
model_additive_pred = lm (Life_expectancy ~ ., data = life_data_clean_trn)
model_transform_pred = lm (Life_expectancy ^ 2 ~ Status + sqrt(Alcohol)
+ sqrt(Adult_Mortality) + Income_composition_of_resources ^ 2 + 1/sqrt(Total_expenditure) + log(Population)
+ BMI + 1/sqrt(HIV_AIDS) + Diphtheria ^ 2 + Hepatitis_B ^ 2 + Measles + Percentage_expenditure
+ Polio ^2 + log(Thinness_1_to_19_yrs),
data = life_data_clean_trn)
model_best_aic_pred = lm (Life_expectancy ~ Year +
+ Adult_Mortality + Percentage_expenditure + BMI + Diphtheria + HIV_AIDS+
+ Schooling + Income_composition_of_resources, data = life_data_clean_trn)
model_best_bic_pred = lm (Life_expectancy ~ Year +
+ Adult_Mortality + Percentage_expenditure + Schooling + BMI + HIV_AIDS+
+ Diphtheria + Income_composition_of_resources, data = life_data_clean_trn)
model_back_aic_pred = step(model_additive_pred, trace = 0)
model_back_bic_pred = step(model_additive_pred, trace = 0, k = log(nrow(life_data_clean_trn)))
model_fwd_aic_pred = step(lm (Life_expectancy ~ 1, data = life_data_clean_trn),
scope =Life_expectancy ~
Year
+ Status
+ Income_composition_of_resources
+ Adult_Mortality
+ Total_expenditure
+ Alcohol
+ Percentage_expenditure
+ Hepatitis_B
+ Measles
+ BMI
+ Under_5_deaths
+ Polio
+ HIV_AIDS
+ Diphtheria
+ Thinness_1_to_19_yrs
+ Schooling
,
direction = "forward", trace = 0)
model_fwd_bic_pred = step(lm (Life_expectancy ~ 1, data = life_data_clean_trn),
scope =Life_expectancy ~
Year
+ Status
+ Income_composition_of_resources
+ Adult_Mortality
+ Total_expenditure
+ Alcohol
+ Percentage_expenditure
+ Hepatitis_B
+ Measles
+ BMI
+ Under_5_deaths
+ Polio
+ HIV_AIDS
+ Diphtheria
+ Thinness_1_to_19_yrs
,
direction = "forward", k = log(nrow(life_data_clean_trn)), trace = 0)
model_poly_pred = lm(Life_expectancy~sqrt(Adult_Mortality)*(Income_composition_of_resources^2)*Percentage_expenditure*Schooling*BMI+HIV_AIDS+Diphtheria+Polio+Year*Status,data=life_data_clean_trn)
model_prediction_data = function(model) {
res = list(
rmse_test = round(sqrt(mean((life_data_clean_tst$Life_expectancy - predict(model, life_data_clean_tst)) ^ 2)), 4),
rsme_train = round(sqrt(mean((life_data_clean_trn$Life_expectancy - predict(model, life_data_clean_trn)) ^ 2)), 4)
)
}
df_result = rbind( additive = model_prediction_data(model_additive_pred),
#transform = model_prediction_data(model_transform_pred),
best_aic = model_prediction_data(model_best_aic_pred),
best_bic = model_prediction_data(model_best_bic_pred),
back_aic = model_prediction_data(model_back_aic_pred),
back_bic = model_prediction_data(model_back_bic_pred),
fwd_aic = model_prediction_data(model_fwd_aic_pred),
fwd_bic = model_prediction_data(model_fwd_bic_pred),
poly = model_prediction_data(model_poly_pred)
)
knitr::kable(df_result)
| rmse_test | rsme_train | |
|---|---|---|
| additive | 3.5654 | 3.6408 |
| best_aic | 3.5278 | 3.7212 |
| best_bic | 3.5278 | 3.7212 |
| back_aic | 3.5793 | 3.6424 |
| back_bic | 3.6147 | 3.6642 |
| fwd_aic | 3.5793 | 3.6424 |
| fwd_bic | 3.9932 | 3.8947 |
| poly | 4.3949 | 3.521 |
The best RMSE for both train and test data sets shows polynomial model
testing = predict(model_poly_pred,life_data_clean_tst)
#compare actual and predicted values of Life_expectancy
df_pred = head(cbind(actual = life_data_clean_tst$Life_expectancy,predicted = testing))
knitr::kable(df_pred,row.names = FALSE,caption = "Actual Vs Predicted")
| actual | predicted |
|---|---|
| 58.1 | 60.2358 |
| 57.3 | 59.4986 |
| 57.3 | 59.2965 |
| 54.8 | 56.8075 |
| 75.3 | 74.8445 |
| 75.9 | 73.9744 |
As we concluded above, we will use polynomial model going forward. Below is the formula of the model selected.
summary(model_poly)
##
## Call:
## lm(formula = Life_expectancy ~ sqrt(Adult_Mortality) * (Income_composition_of_resources^2) *
## Percentage_expenditure * Schooling * BMI + HIV_AIDS + Diphtheria +
## Polio + Year * Status, data = life_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.078 -2.039 0.039 2.224 12.375
##
## Coefficients:
## Estimate
## (Intercept) -3.27e+02
## sqrt(Adult_Mortality) 1.75e+00
## Income_composition_of_resources 4.75e+01
## Percentage_expenditure 2.17e-01
## Schooling 3.89e+00
## BMI 7.02e-01
## HIV_AIDS -5.28e-01
## Diphtheria 1.64e-02
## Polio 5.28e-03
## Year 1.72e-01
## StatusDeveloping 7.01e+02
## sqrt(Adult_Mortality):Income_composition_of_resources -2.92e+00
## sqrt(Adult_Mortality):Percentage_expenditure -1.32e-02
## Income_composition_of_resources:Percentage_expenditure -2.15e-01
## sqrt(Adult_Mortality):Schooling -1.93e-01
## Income_composition_of_resources:Schooling -3.00e+00
## Percentage_expenditure:Schooling -1.56e-02
## sqrt(Adult_Mortality):BMI -1.67e-02
## Income_composition_of_resources:BMI -6.97e-01
## Percentage_expenditure:BMI -4.14e-03
## Schooling:BMI -5.97e-02
## Year:StatusDeveloping -3.49e-01
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure 1.16e-02
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling 2.54e-01
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling 1.05e-03
## Income_composition_of_resources:Percentage_expenditure:Schooling 1.58e-02
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI 3.34e-02
## sqrt(Adult_Mortality):Percentage_expenditure:BMI 2.90e-04
## Income_composition_of_resources:Percentage_expenditure:BMI 4.15e-03
## sqrt(Adult_Mortality):Schooling:BMI 1.80e-03
## Income_composition_of_resources:Schooling:BMI 6.13e-02
## Percentage_expenditure:Schooling:BMI 3.07e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling -9.84e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI -2.68e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI -3.01e-03
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI -2.28e-05
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI -3.14e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 2.21e-05
## Std. Error
## (Intercept) 1.12e+02
## sqrt(Adult_Mortality) 5.28e-01
## Income_composition_of_resources 1.42e+01
## Percentage_expenditure 5.10e-02
## Schooling 7.19e-01
## BMI 2.67e-01
## HIV_AIDS 1.71e-02
## Diphtheria 5.31e-03
## Polio 5.10e-03
## Year 5.56e-02
## StatusDeveloping 1.22e+02
## sqrt(Adult_Mortality):Income_composition_of_resources 1.06e+00
## sqrt(Adult_Mortality):Percentage_expenditure 3.98e-03
## Income_composition_of_resources:Percentage_expenditure 6.00e-02
## sqrt(Adult_Mortality):Schooling 5.10e-02
## Income_composition_of_resources:Schooling 1.12e+00
## Percentage_expenditure:Schooling 3.78e-03
## sqrt(Adult_Mortality):BMI 2.05e-02
## Income_composition_of_resources:BMI 4.29e-01
## Percentage_expenditure:BMI 1.00e-03
## Schooling:BMI 2.11e-02
## Year:StatusDeveloping 6.06e-02
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure 4.94e-03
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling 8.80e-02
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling 3.17e-04
## Income_composition_of_resources:Percentage_expenditure:Schooling 4.41e-03
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI 3.46e-02
## sqrt(Adult_Mortality):Percentage_expenditure:BMI 8.24e-05
## Income_composition_of_resources:Percentage_expenditure:BMI 1.12e-03
## sqrt(Adult_Mortality):Schooling:BMI 1.70e-03
## Income_composition_of_resources:Schooling:BMI 3.12e-02
## Percentage_expenditure:Schooling:BMI 7.15e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling 3.82e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI 9.35e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI 2.65e-03
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI 6.30e-06
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 7.96e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 7.04e-06
## t value
## (Intercept) -2.93
## sqrt(Adult_Mortality) 3.31
## Income_composition_of_resources 3.34
## Percentage_expenditure 4.25
## Schooling 5.41
## BMI 2.63
## HIV_AIDS -30.86
## Diphtheria 3.08
## Polio 1.04
## Year 3.10
## StatusDeveloping 5.76
## sqrt(Adult_Mortality):Income_composition_of_resources -2.74
## sqrt(Adult_Mortality):Percentage_expenditure -3.32
## Income_composition_of_resources:Percentage_expenditure -3.59
## sqrt(Adult_Mortality):Schooling -3.79
## Income_composition_of_resources:Schooling -2.68
## Percentage_expenditure:Schooling -4.13
## sqrt(Adult_Mortality):BMI -0.81
## Income_composition_of_resources:BMI -1.62
## Percentage_expenditure:BMI -4.13
## Schooling:BMI -2.83
## Year:StatusDeveloping -5.76
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure 2.35
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling 2.88
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling 3.32
## Income_composition_of_resources:Percentage_expenditure:Schooling 3.58
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI 0.96
## sqrt(Adult_Mortality):Percentage_expenditure:BMI 3.52
## Income_composition_of_resources:Percentage_expenditure:BMI 3.71
## sqrt(Adult_Mortality):Schooling:BMI 1.06
## Income_composition_of_resources:Schooling:BMI 1.96
## Percentage_expenditure:Schooling:BMI 4.29
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling -2.58
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI -2.86
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI -1.14
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI -3.62
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI -3.94
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 3.14
## Pr(>|t|)
## (Intercept) 0.00343
## sqrt(Adult_Mortality) 0.00096
## Income_composition_of_resources 0.00085
## Percentage_expenditure 2.3e-05
## Schooling 7.4e-08
## BMI 0.00864
## HIV_AIDS < 2e-16
## Diphtheria 0.00210
## Polio 0.30028
## Year 0.00199
## StatusDeveloping 1.0e-08
## sqrt(Adult_Mortality):Income_composition_of_resources 0.00614
## sqrt(Adult_Mortality):Percentage_expenditure 0.00093
## Income_composition_of_resources:Percentage_expenditure 0.00035
## sqrt(Adult_Mortality):Schooling 0.00016
## Income_composition_of_resources:Schooling 0.00747
## Percentage_expenditure:Schooling 3.8e-05
## sqrt(Adult_Mortality):BMI 0.41624
## Income_composition_of_resources:BMI 0.10449
## Percentage_expenditure:BMI 3.8e-05
## Schooling:BMI 0.00471
## Year:StatusDeveloping 9.9e-09
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure 0.01876
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling 0.00398
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling 0.00092
## Income_composition_of_resources:Percentage_expenditure:Schooling 0.00036
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI 0.33487
## sqrt(Adult_Mortality):Percentage_expenditure:BMI 0.00044
## Income_composition_of_resources:Percentage_expenditure:BMI 0.00022
## sqrt(Adult_Mortality):Schooling:BMI 0.29058
## Income_composition_of_resources:Schooling:BMI 0.04960
## Percentage_expenditure:Schooling:BMI 1.9e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling 0.01004
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI 0.00425
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI 0.25645
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI 0.00030
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 8.5e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 0.00175
##
## (Intercept) **
## sqrt(Adult_Mortality) ***
## Income_composition_of_resources ***
## Percentage_expenditure ***
## Schooling ***
## BMI **
## HIV_AIDS ***
## Diphtheria **
## Polio
## Year **
## StatusDeveloping ***
## sqrt(Adult_Mortality):Income_composition_of_resources **
## sqrt(Adult_Mortality):Percentage_expenditure ***
## Income_composition_of_resources:Percentage_expenditure ***
## sqrt(Adult_Mortality):Schooling ***
## Income_composition_of_resources:Schooling **
## Percentage_expenditure:Schooling ***
## sqrt(Adult_Mortality):BMI
## Income_composition_of_resources:BMI
## Percentage_expenditure:BMI ***
## Schooling:BMI **
## Year:StatusDeveloping ***
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure *
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling **
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling ***
## Income_composition_of_resources:Percentage_expenditure:Schooling ***
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI
## sqrt(Adult_Mortality):Percentage_expenditure:BMI ***
## Income_composition_of_resources:Percentage_expenditure:BMI ***
## sqrt(Adult_Mortality):Schooling:BMI
## Income_composition_of_resources:Schooling:BMI *
## Percentage_expenditure:Schooling:BMI ***
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling *
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI **
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI ***
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI ***
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.57 on 1611 degrees of freedom
## Multiple R-squared: 0.839, Adjusted R-squared: 0.835
## F-statistic: 227 on 37 and 1611 DF, p-value: <2e-16
Looking at the diagnostics, we will check can our model still do better. We will now look at the individual significant of the parameters of this model to see if we can eliminate any predictors
names(coef(model_poly))
## [1] "(Intercept)"
## [2] "sqrt(Adult_Mortality)"
## [3] "Income_composition_of_resources"
## [4] "Percentage_expenditure"
## [5] "Schooling"
## [6] "BMI"
## [7] "HIV_AIDS"
## [8] "Diphtheria"
## [9] "Polio"
## [10] "Year"
## [11] "StatusDeveloping"
## [12] "sqrt(Adult_Mortality):Income_composition_of_resources"
## [13] "sqrt(Adult_Mortality):Percentage_expenditure"
## [14] "Income_composition_of_resources:Percentage_expenditure"
## [15] "sqrt(Adult_Mortality):Schooling"
## [16] "Income_composition_of_resources:Schooling"
## [17] "Percentage_expenditure:Schooling"
## [18] "sqrt(Adult_Mortality):BMI"
## [19] "Income_composition_of_resources:BMI"
## [20] "Percentage_expenditure:BMI"
## [21] "Schooling:BMI"
## [22] "Year:StatusDeveloping"
## [23] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure"
## [24] "sqrt(Adult_Mortality):Income_composition_of_resources:Schooling"
## [25] "sqrt(Adult_Mortality):Percentage_expenditure:Schooling"
## [26] "Income_composition_of_resources:Percentage_expenditure:Schooling"
## [27] "sqrt(Adult_Mortality):Income_composition_of_resources:BMI"
## [28] "sqrt(Adult_Mortality):Percentage_expenditure:BMI"
## [29] "Income_composition_of_resources:Percentage_expenditure:BMI"
## [30] "sqrt(Adult_Mortality):Schooling:BMI"
## [31] "Income_composition_of_resources:Schooling:BMI"
## [32] "Percentage_expenditure:Schooling:BMI"
## [33] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling"
## [34] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI"
## [35] "sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI"
## [36] "sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI"
## [37] "Income_composition_of_resources:Percentage_expenditure:Schooling:BMI"
## [38] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI"
The above are all the coefficients of the model. We will use them to compare to the below filtered list of p-values > .1
a <- coef(summary(model_poly))[,"Pr(>|t|)"]
names(a[a>.01])
## [1] "Polio"
## [2] "sqrt(Adult_Mortality):BMI"
## [3] "Income_composition_of_resources:BMI"
## [4] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure"
## [5] "sqrt(Adult_Mortality):Income_composition_of_resources:BMI"
## [6] "sqrt(Adult_Mortality):Schooling:BMI"
## [7] "Income_composition_of_resources:Schooling:BMI"
## [8] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling"
## [9] "sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI"
All parameters are significant
We look at variance inflation factors, and filter by only vifs that are >5
car::vif(model_poly)
## sqrt(Adult_Mortality)
## 8.91090e+02
## Income_composition_of_resources
## 8.74100e+02
## Percentage_expenditure
## 1.04136e+06
## Schooling
## 5.21921e+02
## BMI
## 3.59141e+03
## HIV_AIDS
## 1.37393e+00
## Diphtheria
## 1.69488e+00
## Polio
## 1.69129e+00
## Year
## 6.66083e+00
## Status
## 2.39484e+05
## sqrt(Adult_Mortality):Income_composition_of_resources
## 1.24285e+03
## sqrt(Adult_Mortality):Percentage_expenditure
## 4.03822e+05
## Income_composition_of_resources:Percentage_expenditure
## 1.14009e+06
## sqrt(Adult_Mortality):Schooling
## 9.94743e+02
## Income_composition_of_resources:Schooling
## 2.24803e+03
## Percentage_expenditure:Schooling
## 1.45994e+06
## sqrt(Adult_Mortality):BMI
## 3.21407e+03
## Income_composition_of_resources:BMI
## 7.19492e+03
## Percentage_expenditure:BMI
## 1.26814e+06
## Schooling:BMI
## 5.84200e+03
## Year:Status
## 2.39630e+05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure
## 4.86486e+05
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling
## 1.85237e+03
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling
## 6.41285e+05
## Income_composition_of_resources:Percentage_expenditure:Schooling
## 1.58696e+06
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI
## 5.47005e+03
## sqrt(Adult_Mortality):Percentage_expenditure:BMI
## 5.33188e+05
## Income_composition_of_resources:Percentage_expenditure:BMI
## 1.25602e+06
## sqrt(Adult_Mortality):Schooling:BMI
## 4.59203e+03
## Income_composition_of_resources:Schooling:BMI
## 9.84693e+03
## Percentage_expenditure:Schooling:BMI
## 1.72039e+06
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling
## 7.33117e+05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI
## 5.36506e+05
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI
## 7.35438e+03
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI
## 8.10010e+05
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI
## 1.70870e+06
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI
## 8.01228e+05
We notice that there are no high vif values so we make no changes
We will now look at high influence points and investigate them
As an experiment we try and remove the influentials and see what impact this has on the diagnostics
sum(cooks.distance(model_poly) > 4 / length(cooks.distance(model_poly)))
## [1] 101
inf_points_removed = subset(life_data_clean, cooks.distance(model_poly) <= 4 / length(cooks.distance(model_poly)))
model_poly_inf_removed = lm(Life_expectancy~sqrt(Adult_Mortality)*(Income_composition_of_resources^2)*Percentage_expenditure*Schooling*BMI+HIV_AIDS+Diphtheria+Polio+Year*Status,data=inf_points_removed)
Now we compare the diagnostics data
model_with_influential = diagnostics(model_poly)
model_without_influential = diagnostics(model_poly_inf_removed)
df_result = rbind(model_with_influential=model_with_influential, model_without_influential=model_without_influential)
knitr::kable(df_result, align = "cccccccc", caption = "Model Comparison with and without Influential datapoints")
| num_predictors | sw_pvalue | sw_decision | bp_pvalue | bp_decision | RMSE | AdjustedR2 | AIC | |
|---|---|---|---|---|---|---|---|---|
| model_with_influential | 37 | 6.78173115056804e-08 | Reject | 6.79157135486358e-17 | Reject | 3.5305 | 0.83513 | 4236.26 |
| model_without_influential | 37 | 0.0000549230744637251 | Reject | 0.0107460089192416 | Fail to Reject. | 2.9627 | 0.87119 | 3438.56 |
We see that our diagnostics have improved significantly, including BP test and saphiro test p-values, lower RMSE and higher adj R2.
We will have to sacrifice about 9% of the observations but the improvements in RMSE and Adjusted Rsquare are significant.
So now we know that it is the influential points that are causing our model to have less than ideal diagnostics and hence we will discard the influential points and select the new model as our better model
Hence now our good model is model_without_influential and the dataset is inf_points_removed
We have already seen this in various places but we will now compare the diagnostics of all the models that we have seen to see how we have progressed
| num_predictors | sw_pvalue | sw_decision | bp_pvalue | bp_decision | RMSE | AdjustedR2 | AIC | |
|---|---|---|---|---|---|---|---|---|
| m_additive | 17 | 5.80987998863635e-08 | Reject | 6.41579265092972e-33 | Reject | 3.6097 | 0.82976 | 4269.41 |
| m_transform_1 | 12 | 5.58980035080847e-15 | Reject | 3.40055212895347e-86 | Reject | 4.8562 | 0.71269 | 21269.62 |
| m_transform_1_step | 12 | 5.58980035080847e-15 | Reject | 3.40055212895347e-86 | Reject | 4.8573 | 0.71269 | 21269.62 |
| model_min_aic | 8 | 6.15431590889256e-10 | Reject | 3.4663551604506e-29 | Reject | 3.6678 | 0.8252 | 4304.07 |
| model_min_bic | 8 | 6.15431590889256e-10 | Reject | 3.46635516045084e-29 | Reject | 3.6678 | 0.8252 | 4304.07 |
| model_back_aic | 14 | 4.22813740389515e-08 | Reject | 1.3205694135825e-34 | Reject | 3.6138 | 0.82969 | 4267.14 |
| model_back_bic | 10 | 2.23394873540412e-08 | Reject | 4.56792467310997e-35 | Reject | 3.6319 | 0.8284 | 4275.64 |
| model_fwd_aic | 13 | 2.64491669882093e-08 | Reject | 1.37799693326893e-34 | Reject | 3.617 | 0.82949 | 4268.02 |
| model_fwd_bic | 10 | 5.30301565996376e-13 | Reject | 1.60142326938387e-81 | Reject | 3.8867 | 0.80348 | 4499.21 |
| model_poly | 37 | 6.78173115056804e-08 | Reject | 6.79157135486358e-17 | Reject | 3.5305 | 0.83513 | 4236.26 |
| model_without_influential | 37 | 0.0000549230744637251 | Reject | 0.0107460089192416 | Fail to Reject. | 2.9627 | 0.87119 | 3438.56 |
We also look at the diagnostics plot of our selected model
Our diagnostics plots look fairly good. Using an \(\alpha = .01\), even BP test validate the equal variance assumptions. Hence we will select model_ploy as our final model.
We spent significant time inspecting, cleaning the data and identifying the best possible model. Majorly time was spent in trying various predictors, their combinations and possible transformations. We looked at variance inflation factors and removed variables with high vifs. Visual pair plots, collinearity and partial correlation coefficient mechanism was applied repetitively to identify collinearity in regression models.
We also used boxcox to identify possible response and predictor transformations needed.
Multiple model selection methodology is applied including but not limited to simple additive, step-wise forward and backward using AIC & BIC, interactive, polynomial and their combination.
We spent a significant time looking at the individual significance test and experimenting with the variables there. This analysis had a limited benefit, and resulted in only making our diagnostic parameters worse. We still picked the large interactive polynomial model satisfying all major criteria such as train and test RMSE, adjusted \(R^2\) and passing bp test. We also run anova test on top three models
We then investigated the result of removing influential observations, this had some improvement on the diagnostics, and it seemed like a good compromise
After looking at the diagnostics plot we come to the final conclusion that we have a good enough model for identifying the key predictors for explaining the model and predicting with reasonable accuracy.
As we can see from the diagnostic statistics above, the model is useful since it has a fairly low RMSE with a very high adjusted \(R^2\) and AIC is also very low.
Hence we can conclude that we have a fairly good model.
Final Note: Contrary to our initial assumption, Life expectancy data analysis and designing a useful model was not straight-forward and was a challenging task. After trying several different combinations our team came up with a complex model. The combination of predictor interaction and transformation makes our model slightly complex but variable selection and their interpretation looks relevant for Life Expectancy. So we conclude that the model we have selected is good for prediction.
| Name | NetID | |
|---|---|---|
| Boris Tsekinovsky | borist3 | borist3@illinois.edu |
| Sid Rathi | rathi9 | rathi9@illinois.edu |
| Suraj Bisht | surajb2 | surajb2@illinois.edu |
#histograms
hist(inf_points_removed$BMI,main="Distribution of BMI",xlab = "BMI",ylab="Count", col = "blue",ylim=c(0,300))
hist(inf_points_removed$Hepatitis_B, main="Distribution of Hepatitis B",xlab = "Hepatitis B",ylab="Count", col = "blue")
hist(inf_points_removed$Measles,main="Distribution of Measles",xlab = "Measles",ylab="Count", col = "blue")
hist(inf_points_removed$Polio,main="Distribution of Polio",xlab = "Polio",ylab="Count", col = "blue")
hist(inf_points_removed$Diphtheria,main="Distribution of Diphtheria",xlab = "Diphtheria",ylab="Count", col = "blue")
hist(inf_points_removed$Adult_Mortality,main="Distribution of Adult Mortality",xlab = "Adult Mortality",ylab="Count", col = "blue")
hist(inf_points_removed$Alcohol,main="Distribution of Alcohol",xlab = "Alcohol",ylab="Count", col = "blue")
hist(inf_points_removed$HIV_AIDS,main="Distribution of HIV AIDS",xlab = "HIV AIDS",ylab="Count", col = "blue")
hist(inf_points_removed$Percentage_expenditure,main="Distribution of Percentage Expenditure",xlab = "Percentage Expenditure",ylab="Count", col = "blue")
hist(inf_points_removed$Life_expectancy,main="Distribution of Life Expectancy",xlab = "Life Expectancy", ylab="Count", col = "blue")
hist(inf_points_removed$Under_5_deaths,main="Distribution of Under Five Deaths ",xlab = "Under Five Deaths",ylab="Count", col = "blue",ylim=c(0,2000))
hist(inf_points_removed$Thinness_1_to_19_yrs,main="Distribution of thinness 1-19",xlab = "Thinness 1-19",ylab="Count", col = "blue",ylim=c(0,300))
```